Strip a leading byte order mark before parsing#801
Merged
Conversation
libyaml only discounts the BOM when it detects the stream encoding by itself. Psych passes the encoding explicitly whenever it is known, and on that path libyaml counts the BOM as a first-line character, shifting every token on the first line one column right and silently terminating a block mapping at the second line. #331 Co-Authored-By: Claude Fable 5 <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR fixes YAML parsing when an input stream begins with a Unicode byte order mark (BOM), aligning behavior with YAML 1.2 by ensuring BOMs don’t shift token columns and break multi-line root-level mappings.
Changes:
- Strip a leading BOM in
Psych::Parser#parsefor UTF-8/UTF-16 inputs before handing data to libyaml. - Add regression tests for
Psych.load,Psych.parse_stream, andPsych::Parser#parsecovering multi-line mappings with BOM (UTF-8, UTF-16, and IO cases).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| lib/psych/parser.rb | Adds Ruby-level BOM stripping before invoking the native libyaml parser. |
| test/psych/test_psych.rb | Adds high-level regression tests for load and parse_stream with a leading BOM. |
| test/psych/test_parser.rb | Adds parser-level BOM regression tests (UTF-8/UTF-16/IO) and a helper for scalar extraction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+80
to
+83
| if String === yaml | ||
| bom = BOM[yaml.encoding] | ||
| return yaml[1..-1] if bom && yaml.start_with?(bom) | ||
| elsif yaml.respond_to?(:read) && yaml.respond_to?(:external_encoding) && |
If pos succeeded but the later seek failed, the rescue silently discarded the bytes read to check for a BOM. Only the initial pos call is expected to fail, for non-seekable IOs, before anything is consumed. Co-Authored-By: Claude Fable 5 <[email protected]>
The C extension transcodes UTF-32 strings to UTF-8 with the BOM preserved, so they were truncated the same way as UTF-8 input. Co-Authored-By: Claude Fable 5 <[email protected]>
Co-Authored-By: Claude Fable 5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #331.
YAML.load("\uFEFFa: b\nc: d")returns{"a" => "b"}, silently dropping everything after the first newline, andPsych.parse_streamraisesPsych::SyntaxErroron the same input. A BOM at the start of the stream is legal per YAML 1.2 §5.2.Psych tells libyaml the input encoding whenever it is known, so libyaml's reader-level BOM stripping, which only runs during encoding auto-detection, never happens. The scanner skips the BOM instead but counts it as a first-line column, so every token on the first line shifts one column right and a root-level block mapping terminates at the second line. Reported upstream as yaml/libyaml#334.
This strips the BOM at the Ruby level before the input reaches libyaml: the first character of UTF-8/UTF-16 strings, and of seekable IOs whose external encoding is one of those. Binary strings and IOs are left untouched since they go through libyaml's auto-detection, which already handles the BOM correctly. JRuby is unaffected (snakeyaml-engine excludes the BOM from column counting) and the new tests pass on both implementations.
🤖 Generated with Claude Code